Risk factors for high CAD-RADS scoring in CAD patients revealed by machine learning methods: a retrospective study

Objective This study aimed to investigate a variety of machine learning (ML) methods to predict the association between cardiovascular risk factors and coronary artery disease-reporting and data system (CAD-RADS) scores. Methods This is a retrospective cohort study. Demographical, cardiovascular risk factors and coronary CT angiography (CCTA) characteristics of the patients were obtained. Coronary artery disease (CAD) was evaluated using CAD-RADS score. The stenosis severity component of the CAD-RADS was stratified into two groups: CAD-RADS score 0-2 group and CAD-RADS score 3–5 group. CAD-RADS scores were predicted with random forest (RF), k-nearest neighbors (KNN), support vector machines (SVM), neural network (NN), decision tree classification (DTC) and linear discriminant analysis (LDA). Prediction sensitivity, specificity, accuracy and area under the curve (AUC) were calculated. Feature importance analysis was utilized to find the most important predictors. Results A total of 442 CAD patients with CCTA examinations were included in this study. 234 (52.9%) subjects were CAD-RADS score 0–2 group and 208 (47.1%) were CAD-RADS score 3–5 group. CAD-RADS score 3-5 group had a high prevalence of hypertension (66.8%), hyperlipidemia (50%) and diabetes mellitus (DM) (35.1%). Age, systolic blood pressure (SBP), mean arterial pressure, pulse pressure, pulse pressure index, plasma fibrinogen, uric acid and blood urea nitrogen were significantly higher (p < 0.001), and high-density lipoprotein (HDL-C) lower (p < 0.001) in CAD-RADS score 3–5 group compared to the CAD-RADS score 0–2 group. Nineteen features were chosen to train the models. RF (AUC = 0.832) and LDA (AUC = 0.81) outperformed SVM (AUC = 0.772), NN (AUC = 0.773), DTC (AUC = 0.682), KNN (AUC = 0.707). Feature importance analysis indicated that plasma fibrinogen, age and DM contributed most to CAD-RADS scores. Conclusion ML algorithms are capable of predicting the correlation between cardiovascular risk factors and CAD-RADS scores with high accuracy.


INTRODUCTION
CAD is the leading cause of morbidity and mortality worldwide (Akella & Akella, 2021;Popa et al., 2020;Saharan et al., 2021) and is one of the major contributors to healthcare costs in China. The pathogenesis of CAD is complex and is affected by a variety of risk factors, with atherosclerosis being the most common underlying cause of cardiovascular diseases (Popa et al., 2020). Multiple conventional risk factors augment the atherosclerotic process, including age,sex,smoking, hypertension, hyperlipidemia, DM, hyperuricemia, coagulation abnormalities, obesity, insulin resistance, C-reactive protein levels, plasma fibrinogen, and others (Giacco & Brownlee, 2010;Kazemian et al., 2020;Shariatnia et al., 2022;Song et al., 2015;Tsai, Chiang & Huang, 2020;Velusamy & Ramasamy, 2021;Williams et al., 2018;Yang et al., 2018). It is indispensable to comprehend and properly calculate the etiological contribution of these risk factors to devise and improve preventive tactics for CAD. In 2016, the Society of Cardiovascular Computed Tomography (SCCT), the American College of Radiology (ACR), and the North American Society for Cardiovascular Imaging (NASCI) published the CAD-RADS, which is a new standardized method to assess CAD using CCTA (Rubinshtein & Hamdan, 2020). Although there have been numerous studies on CAD risk prediction, studies involving the application of CAD-RADS on traditional risk factors on the Chinese population evaluated by CCTA remain understudied, and the impact of CAD-RADS management and outcome is still unknown (Foldyna et al., 2018), while risk assessment is crucial for the reduction of the worldwide burden of CAD.
Machine learning (ML) have been developed to predict outcomes in cardiovascular disease and have the potential to provide useful insights for cardiovascular medicine systems (Khalaji et al., 2022;Li et al., 2022). ML accommodates most artificial intelligence (AI) technologies in the medical research setting and includes various algorithms for prediction and classification tasks that perform well on complex big data (Kagiyama et al., 2019). These algorithms have emerged as valuable tools for predicting patient outcomes based on pertinent feature characteristics variables and have already been applied to identify unknown CAD risk factors, automate imaging interpretation, and enhance clinical decision-making, thus facilitating precision medicine (Huang et al., 2022;Panteris et al., 2022;Saravi et al., 2022). Some of the most widely used mathematical methods for predictions are discriminant analysis, logistic regression, neural networks, and classification and regression trees (Shariatnia et al., 2022). The strongest predictors can be selected to train the system to predict outcomes using supervised learning (Khalaji et al., 2022). Although ML had been applied in literature to predict CAD-RADS scores, few studies had evaluated commonly used clinical risk factors in predicting these scores (Muscogiuri et al., 2020). This study hypothesizes that ML algorithms have the potential to accurately predict CAD-RADS scores based on the most significant cardiovascular risk factors.

Study Design and data collection
Participants were selected from patients who visited the cardiology department of the First or the Second Affiliated Hospitals of University of South China between October 2017 and December 2022. Demographic and clinical data were collected retrospectively, and information on clinical risk factors was obtained, the time interval between the collection of clinical risk factors and CCTA data was two weeks. Out of the 579 participants who underwent a CCTA scan, 442 subjects were included in the study after excluding those with missing or unsatisfactory CCTA data for analysis (n = 45), incomplete basic clinical information (n = 38), and a history of bypass surgery or percutaneous coronary intervention (PCI) (n = 54) (Fig. 1). The study was approved by the human ethics review board of University of South China (2022020587), and all patients provided written informed consent. Inclusion criteria for the study were as follows: all patients had a free heart rate and cardiac rhythm variation of ≤5 beats/min and no obvious contraindications. Exclusion criteria were a history of valvular heart disease, bypass surgery or PCI, severe arrhythmia, and failure to cooperate during inspection. All coronary segments with a diameter greater than 1.5 mm were evaluated according to the Expert Consensus Document (Cury et al., 2016).

Cardiovascular risk assessment
Demographic variables and traditional CAD risk factors included age, gender, SBP, diastolic blood pressure, mean arterial pressure, pulse pressure, pulse pressure index, hypertension, DM, smoking status, hyperlipidemia, total cholesterol, triglycerides, HDL-C, low-density lipoprotein cholesterol (LDL-C), uric acid, plasma fibrinogen, blood creatinine, and blood urea nitrogen. Patients who were smokers at the time of analysis were classified as current smokers. Hypertension is defined as SBP values ≥140 mmHg and/or diastolic blood pressure values ≥90 mmHg or use of antihypertensive medication (Williams et al., 2018). DM was defined as fasting serum glucose ≥126 mg/dL (7.0 mmol/L),or 2-hour values in the oral glucose tolerance test ≥200 mg/dL (11.1 mmol/L),or hemoglobin A1c level ≥6.5%. Hyperlipidemia was defined as fasting serum total cholesterol level ≥2.3 mmol/L (220 mg/dL) and/or fasting serum triglyceride level ≥150 mg/dL and/or the use of antihyperlipidemic agents. High LDL-C is defined as LDL-C ≥2.6 mmol/L (100 mg/dL) for the first time, while low HDL-C was HDL-C<1.0 mmol/L (40 mg/dL) (Zhou et al., 2022).

CCTA scan protocol
All CCTA scans were performed using two 256-slice multidetector CT scanners (Brilliance iCT 256 from Philips and SOMATOM Definition Flash CT from Siemens). The scanning parameters were as follows: tube voltage of 120 kV, 800 mAs, slices/collimation of 128/0.625 mm, gantry rotation time of 330 ms, pitch of 0.2, effective slice thickness of 0.9 mm, and reconstruction increment of 0.45 mm. Patients with a heart rate > 80 beats/min were given oral beta-blockers 1 h prior to the examination. All patients received 0.5 mg of sublingual nitroglycerin for coronary vasodilatation. A bolus of 1.5 ml/kg of iodinated contrast medium was administered intravenously at a rate of 5 ml/s, followed by 40 ml of saline injected at the same rate. After acquisition, the images were processed using artificial intelligence (AI) software (V2.4.2; Shukun, Beijing, China).
(2) The stenosis severity component of CAD-RADS was stratified into two groups for uniformity and sample size based on previously published methods: CAD-RADS score 0-2 group and CAD-RADS score 3-5 group (Laggoune et al., 2019;Popa et al., 2020). The CAD-RADS scores were generated by AI software (V2.4.2, Shukun). Figure 2 showed the different degrees of coronary artery stenosis in CCTA images.

Test/train split and feature selection
The study population was randomly assigned to the training cohort, which comprised 70% of the patients, and the test cohort, which comprised 30% of the sample, in order to validate the predictive models (Gao et al., 2015;Khalaji et al., 2022). The training dataset was used to train the model, which learned from the data in this dataset. The test dataset was then used to provide an unbiased evaluation of the final model fit to the training dataset (Akella & Akella, 2021). Feature selection was performed using a technique known as ''information gain attribute ranking'' (Motwani et al., 2017), the most significant predictors were obtained from the random forest (RF) model prediction in the training data using 10-fold cross-validation (Khalaji et al., 2022). The dataset was partitioned into ten distinct subsets, with nine of them designated for training and one for evaluation. This process was repeated ten times using ten different but overlapping sets for training and testing.

Model development and performance evaluation
To develop predictive models, we used six ML methods: random forest (RF), support vector machine (SVM), neural network (NN), k-nearest neighbor (KNN), decision tree classification (DTC) and linear discriminant analysis (LDA). All models were implemented using the statistical software package R and JASP and designed using k-fold (k = 10) cross-validation. We tuned the parameters for each model using the grid search method to increase the prediction accuracy. The training set was used to learn the ML parameters, while the test set was used for standard evaluation metrics.
Each model was trained and tested for CAD-RADS scores.

Statistical analysis
Statistical analysis was performed using the SSPS software (V25.0; SPSS INc., Chicago, IL, USA). Baseline characteristics are presented as mean ± standard deviation (SD) or frequencies and percentages. Categorical variables were compared using the chi-square or Fisher's exact tests, while continuous variables were analyzed with independent samples t -test. Prior to analysis, we assessed the normality of data distributions and homogeneity of continuous variables.Whenever the distribution of continuous data was not normal, the Mann-Whitney U-test was used for comparison, and results were presented as median (interquartile range, IQR). A p value< 0.05 was considered statistically significant. Six models were employed to utilize the statistical software package R and JASP, and their performance was subsequently compared to determine the optimal selection classifier for identifying high risk factors in predicting CAD-RADS scores.

Baseline characteristics of the study population
A total of 442 CAD patients were included in this cohort, with 268 (60.6%) males and 174 (39.4%) females. The median age was 63 years, with the lowest and highest ages being 18 and 88 years, respectively. Among the entire cohort, there was a high prevalence of patients with hypertension (51.6%) and hyperlipidemia (42.8%). One hundred and forty-two (32.1%) people had a history of smoking and 91 (20.6%) had DM. All subjects were divided into two groups based on CAD-RADS scores: 234 (52.9%) subjects were CAD-RADS score 0-2 group and 208 (47.1%) were CAD-RADS score 3-5 group. The CAD-RADS score 3-5 group had a higher prevalence of hypertension (66.8%), hyperlipidemia (50%), and DM (35.1%). Age, SBP, mean arterial pressure, pulse pressure, pulse pressure index, plasma fibrinogen, Chisquare tests were performed on gender, hypertension, diabetes, smoking, and hyperlipidemia. Pulse pressure index and HDL-C were tested by Student test. Age, systolic blood pressure, diastolic blood pressure, mean arterial pressure, pulse pressure, total cholesterol, triglycerides, plasma fibrinogen, uric acid, creatinine and blood urea nitrogen were analyzed by Mann-Whitney U-test while the group did not follow the normal distribution, and were presented as median (IQR); LDL-C was examined by Welch test while the homogeneity test was not homogeneity.
uric acid, and blood urea nitrogen were significantly higher (p < 0.001), and HDL-C lower (p < 0.001) in the CAD-RADS score 3-5 group compared to the CAD-RADS score 0-2 group (Table 1). There were significant differences in hyperlipidemia, triglycerides, and serum creatinine between the two groups. However, our results did not reveal any association between diastolic blood pressure (p = 0.052), total cholesterol (p = 0.265), LDL-C (p = 0.572) and different CAD-RADS scores. Table 1 illustrated the univariate analysis for the association between cardiovascular risk factors and CAD classified using CAD-RADS.

Model evaluation
We applied six ML algorithms to the test dataset. Table 2 compared the predictive values of different models regarding their AUC, accuracy, sensitivity, and specificity. All the models demonstrated good performance (AUC>0.6) in predicting CAD-RADS scores. RF and LDA models showed excellent discrimination with an AUC of 0.832 and 0.81, respectively. SVM and NN had an acceptable performance, and DTC showed the lowest discriminatory   ability with an AUC of 0.682. After tuning for the threshold, the SVM model achieved the highest sensitivity and specificity, both at 0.772. Both RF and SVM showed the highest accuracy, both at 0.773. Figure 3 illustrated the ROC-AUC for the six models.

Result of feature importance
We employed the RF prediction model to rank all features based on their significance in test data, using k-fold cross-validation (k = 10). Figure 4 showed the order of features for model development. Nineteen features were chosen for predicting CAD-RADS scores. The feature importance analysis revealed that plasma fibrinogen was the most important feature for the classification task, followed by age and DM (Fig. 4).

DISCUSSION
CAD is a serious disease that affects both health and function. Identifying risk factors is crucial for preventing acute coronary events in CAD patients. While many risk factors have been proposed for CAD patients, few studies have investigated risk factors associated with CAD-RADS classification, and no effective systematic model has been proposed to predict whether a patient is at high risk of coronary heart disease. Previous studies investigating the application of AI in the diagnostic pathway of CAD have used different AI algorithms (Khalaji et al., 2022;Muscogiuri et al., 2020;Shariatnia et al., 2022). In this study, we investigated whether CAD-RADS scores could be predicted using ML algorithms based on risk factors data from CAD patients. Our cohort study utilized the cardiovascular medicine databank from two clinical research institutes, which contained diverse demographic information and can provide reliable data on patients with CAD.
CAD-RADS, as a powerful standardized reporting tool, may facilitate further research and provide a framework for standardized collection of CCTA reports across multiple sites for quality improvement and benchmarking (Cury et al., 2016). In CCTA interpretation, a proper assessment of the CAD extent, severity, and characteristics largely depends on the reader's clinical skills and experience. Despite proper CAD assessment, even experienced readers might misclassify cases due to a lack of knowledge of the CAD-RADS classification (Foldyna et al., 2018). In our study, images were post-processed using AI software. In the future, automated classification systems may combine image analysis and standardized reporting tools, leading to more reliable and faster CAD-RADS assessment (Foldyna et al., 2018), especially for CAD-RADS 3-5 score CAD patients, a CAD-RADS grade of 3 or greater suggests consideration of functional evaluation and anti-ischemic or preventative drugs (Cury et al., 2016;Huang et al., 2020;Muscogiuri et al., 2020;Rubinshtein & Hamdan, 2020).
Multiple ML algorithms can be utilized for feature importance analysis, with RF being a commonly employed method, RF can ameliorate prediction accuracy without considerably increasing the calculation amount, maintain high predictive performance, is a very effective method in feature screening and classification . In a recent research, RF model illustrated a good AUC of 0.948 to identify CAD patients from controls, which exhibited favorable predictive capability and clinical application value (Wang et al., 2021). In agreement with this finding, another study compared various ML models for estimating the diagnosis of CAD, and their results showed that RF predictive model achieved 92.04% accuracy and 92.20% ROC respectively and was identified as the best model among other models (Muhammad et al., 2021). The utilization of RF in CAD had been highlighted in other literature as well (Liu et al., 2021;Saharan et al., 2021). Our research has demonstrated that RF exhibited exceptional predictive performance, with an AUC of 0.832, surpassing other models in comparison. These findings are consistent with existing literatures on the potential applications of RF. Our findings also indicated that LDA model demonstrated a comparable predictive ability to the RF model, with an AUC of 0.81. LDA has been recommended as a predictive model with excellent accuracy, sensitivity, and specificity in the applications in cardiovascular diseases (Ricciardi et al., 2020;Shariatnia et al., 2022).
Plasma fibrinogen, as a coagulation index, was the most important feature based on our feature selector, and it was independently associated with coronary severity and complexity in patients with CAD. Plasma fibrinogen, a marker of inflammation and coagulation, may stimulate coagulation, platelet aggregation, and vascular endothelial dysfunction, mediate the transportation of adhesion molecules on the surface of the endothelium and their further migration to the intima, trigger proliferation and migration of smooth-muscle cells to increase coronary plaque vulnerability (Loukas et al., 2002;Song et al., 2015;Tabakcıi et al., 2017), and is a potentially suitable target for CAD. Many studies have examined the role of plasma fibrinogen levels alone in the prediction of CAD events. Song et al. reported that the plasma fibrinogen levels of CAD patients were 0.94-fold higher than the control group and showed a significant association between plasma fibrinogen level and CAD risk (Song et al., 2015). A meta-analysis confirmed that an increase in fibrinogen concentration by 1 g/L, depending on age and sex differences, was associated with a higher risk of CAD by 2.42 (Danesh et al., 2005). Gąsior's study (Gasior et al., 2018) showed that for patients with non-critical stenosis in coronary arteries, higher plasma fibrinogen concentration would predispose them to the occurrence of cardiovascular events, plasma fibrinogen was proved to be a parameter related to the frequency of revascularization. In a recent community-based cohort study (Hsieh et al., 2022), a total of 2,222 participants who underwent plasma fibrinogen measurements and did not have CVD at baseline were recruited in the Taiwanese population. Their findings showed that participants with higher fibrinogen levels tended to have a higher risk of CAD, indicating that a high level of fibrinogen may be a risk factor for CAD. These observations indicated that plasma fibrinogen is independently associated with coronary severity and complexity in patients with CAD. In agreement with these studies, our model suggested that plasma fibrinogen was a significant risk factor for high CAD-RADS scores, patients in CAD-RADS score 3-5 group exhibited higher plasma fibrinogen levels than the CAD-RADS score 0-2 group (p < 0.001). These findings highlight the potential benefits of monitoring blood plasma fibrinogen concentrations in preventing CAD.
In our study, age was the second most important risk predictor for CAD-RADS scores. Similar findings have been reported that age was the second most important risk factor for 5-year mortality in CAD patients undergoing PCI (Liu et al., 2021). Sun et al. (2012) demonstrated that the percentage of patients with significant coronary artery stenosis increased to 38% in patients aged over 65 years compared to less than 15% in patients under 56 years. Kim et al. (2021) used the RF model to define the relative importance of age on coronary plaque progression, they found that the rate of whole-heart plaque progression and dense calcification increases depending on age, as important as any other traditional cardiovascular risk factors. Our research demonstrated that DM was also a significant risk factor for high CAD-RADS scores. Similar to our finding, a previous study showed that patients with DM demonstrated more obstructive CAD on CCTA than patients without DM ( Van den Hoogen et al., 2020). Both age and DM play crucial roles in plaque growth and the progression of coronary atherosclerosis, as evidenced by the size, volume, and density of coronary atherosclerotic plaque which directly impact the degree of stenosis in the coronary artery lumen, thereby affecting CAD-RADS scores.
In summary, our findings suggested that elevated plasma fibrinogen levels, advanced age, and DM were significant predictors of CAD-RADS 3-5 scores. Monitoring plasma fibrinogen and blood glucose levels may offer additional information for the prevention of CAD in clinical practice. Individuals with elevated level of plasma fibrinogen, blood glucose or advanced age should receive increased attention in the CAD prevention efforts. In addition to plasma fibrinogen, age and DM, other variables such as pulse pressure, HDL-C, pulse pressure index, mean arterial pressure, SBP and smoking are also relatively significant in stratifying a patient's risk for CAD. These risk factors seldom occur independently but rather tend to cluster together with other cardiovascular risk factors. Modifying these risk factors may be effective in preventing CAD progression and reducing CAD-RADS scores. Further research with a larger sample size of Chinese patients with CAD is necessary to provide more conclusive evidence regarding these associations.

LIMITATIONS
This study has both strengths and limitations: Firstly, there are still some disadvantages to using ML in cardiovascular practice. There are many unmeasured or unknown important variables, and different classifiers for the same dataset may not all be equally robust (Chuah et al., 2022). Data availability also limits the generalizability of ML algorithms. The data used for training ML models are typically acquired from one or several laboratories, health centers, or hospitals (Shu, Ren & Song, 2021), the outcomes may vary among diverse populations, as a result, external validation of the models is required. Secondly, the data were collected from two hospitals, the study population was relatively small. Another limitation was the absence of pathological confirmation of CAD severity and the crosssectional design. Moreover, the data were collected retrospectively, which may lower the reliability of evidence compared to prospectively collected data. Lastly, we only considered 19 traditional risk factors for CAD. Future studies should include more variables to further validate our findings.

CONCLUSION
This study indicated that RF outperformed other models in predicting CAD-RADS scores among CAD patients, making it a recommended predictive model for identifying high-risk patients with CAD-RADS 3-5 scores. The most significant feature selection were plasma fibrinogen, age and DM, indicating that combined strategies targeting these factors may be effective in preventing the burden of CAD. We hope this study can serve as a valuable resource for future research on this topic.